BTS 510 Lab 8

set.seed(12345)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Stat2Data)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Assess models for (multi)collinearity among predictors
  • Conduct outlier analyses to determine extreme and/or problematic cases

2 Data

  • FirstYearGPA data from the Stat2Data package: n = 219 subjects
    • GPA: First-year college GPA on a 0.0 to 4.0 scale
    • HSGPA: High school GPA on a 0.0 to 4.0 scale
    • SATV: Verbal/critical reading SAT score
    • SATM: Math SAT score
    • Male: 1= male, 0= female
    • HU: Number of credit hours earned in humanities courses in high school
    • SS: Number of credit hours earned in social science courses in high school
    • FirstGen: 1= student is the first in her or his family to attend college, 0=otherwise
    • White: 1= white students, 0= others
    • CollegeBound: 1=attended a high school where >=50% students intended to go on to college, 0=otherwise

3 Research question(s)

  • How do all these variables impact first year GPA (GPA)?
    • Are there any problems with collinearity among the predictors?
    • Are there problematic cases that are influencing the results?

4 Tasks

  1. Run a linear regression model predicting GPA from all other variables in the dataset.
library(Stat2Data)
data(FirstYearGPA)
FirstYearGPA <- FirstYearGPA %>% 
  mutate(ID = row_number())
head(FirstYearGPA)
   GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound ID
1 3.06  3.83  680  770    1  3.0  9.0        1     1            1  1
2 4.15  4.00  740  720    0  9.0  3.0        0     1            1  2
3 3.41  3.70  640  570    0 16.0 13.0        0     0            1  3
4 3.21  3.51  740  700    0 22.0  0.0        0     1            1  4
5 3.48  3.83  610  610    0 30.5  1.5        0     1            1  5
6 2.95  3.25  600  570    0 18.0  3.0        0     1            1  6
m1 <- lm(data = FirstYearGPA, GPA ~ . - ID)
summary(m1)

Call:
lm(formula = GPA ~ . - ID, data = FirstYearGPA)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07412 -0.25827  0.05384  0.27675  0.85761 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.5268983  0.3487584   1.511  0.13235    
HSGPA         0.4932945  0.0745553   6.616 3.03e-10 ***
SATV          0.0005919  0.0003945   1.501  0.13498    
SATM          0.0000847  0.0004447   0.190  0.84912    
Male          0.0482478  0.0570277   0.846  0.39850    
HU            0.0161874  0.0039723   4.075 6.53e-05 ***
SS            0.0073370  0.0055635   1.319  0.18869    
FirstGen     -0.0743417  0.0887490  -0.838  0.40318    
White         0.1962316  0.0700182   2.803  0.00555 ** 
CollegeBound  0.0214530  0.1003350   0.214  0.83090    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3834 on 209 degrees of freedom
Multiple R-squared:  0.3496,    Adjusted R-squared:  0.3216 
F-statistic: 12.48 on 9 and 209 DF,  p-value: 8.674e-16
  1. Some of those variables seem like they could be strongly related, which could cause problems for the model. Check to see if collinearity is an issue for this model. To check this, look at:
  • Correlations among variables
  • Variable inflation factors (VIFs)
cor(FirstYearGPA)
                     GPA       HSGPA        SATV         SATM        Male
GPA           1.00000000  0.44688735  0.30431137  0.194343851  0.05284917
HSGPA         0.44688735  1.00000000  0.21032124  0.152839634 -0.09031714
SATV          0.30431137  0.21032124  1.00000000  0.526943819  0.14555703
SATM          0.19434385  0.15283963  0.52694382  1.000000000  0.37099167
Male          0.05284917 -0.09031714  0.14555703  0.370991668  1.00000000
HU            0.31465575  0.11603117  0.09874856 -0.009601549 -0.01884386
SS           -0.00356909 -0.01725443 -0.02646987 -0.087839974  0.03507603
FirstGen     -0.15657732  0.06418575 -0.25657713 -0.177387395 -0.07610526
White         0.28177214  0.04604668  0.36823365  0.259465227  0.07696022
CollegeBound -0.06302497 -0.20003903  0.06484473  0.039322063  0.09981773
ID           -0.18693938 -0.24169420 -0.09436226 -0.047312698  0.10121274
                       HU          SS     FirstGen       White CollegeBound
GPA           0.314655754 -0.00356909 -0.156577322  0.28177214  -0.06302497
HSGPA         0.116031169 -0.01725443  0.064185751  0.04604668  -0.20003903
SATV          0.098748556 -0.02646987 -0.256577125  0.36823365   0.06484473
SATM         -0.009601549 -0.08783997 -0.177387395  0.25946523   0.03932206
Male         -0.018843863  0.03507603 -0.076105261  0.07696022   0.09981773
HU            1.000000000 -0.30660787 -0.212565615  0.12593391  -0.02997230
SS           -0.306607866  1.00000000 -0.023663260  0.01673417  -0.03170649
FirstGen     -0.212565615 -0.02366326  1.000000000 -0.23790392  -0.11051100
White         0.125933908  0.01673417 -0.237903920  1.00000000  -0.02391156
CollegeBound -0.029972303 -0.03170649 -0.110511005 -0.02391156   1.00000000
ID           -0.091577232 -0.12965617  0.002725596 -0.01524922  -0.01025733
                       ID
GPA          -0.186939380
HSGPA        -0.241694202
SATV         -0.094362258
SATM         -0.047312698
Male          0.101212743
HU           -0.091577232
SS           -0.129656169
FirstGen      0.002725596
White        -0.015249216
CollegeBound -0.010257334
ID            1.000000000
library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
vif(m1)
       HSGPA         SATV         SATM         Male           HU           SS 
    1.158078     1.605182     1.660002     1.205737     1.221530     1.147867 
    FirstGen        White CollegeBound 
    1.186760     1.212049     1.073989 
  1. Based on the correlations between each predictor and the outcome (above), are there predictors that you think should be significant but aren’t? Why do you think they aren’t? (This is a bit of a philosophical question.)

  2. Are there extreme cases that we might be concerned about? Check for extreme values in terms of:

  • Predictors
  • Predicted values
  • Changes to predicted values
library(broom)
m1_aug <- augment(m1)
m1_aug
# A tibble: 219 × 17
     GPA HSGPA  SATV  SATM  Male    HU    SS FirstGen White CollegeBound    ID
   <dbl> <dbl> <int> <int> <int> <dbl> <dbl>    <int> <int>        <int> <int>
 1  3.06  3.83   680   770     1   3     9          1     1            1     1
 2  4.15  4      740   720     0   9     3          0     1            1     2
 3  3.41  3.7    640   570     0  16    13          0     0            1     3
 4  3.21  3.51   740   700     0  22     0          0     1            1     4
 5  3.48  3.83   610   610     0  30.5   1.5        0     1            1     5
 6  2.95  3.25   600   570     0  18     3          0     1            1     6
 7  3.6   3.79   710   630     0   5    19          0     1            1     7
 8  2.87  3.6    390   570     0  10     0          0     0            0     8
 9  3.67  3.36   630   560     0   8.5  15.5        0     1            1     9
10  3.49  3.7    680   670     0  16    12          0     1            1    10
# ℹ 209 more rows
# ℹ 6 more variables: .fitted <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>
ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .hat)) +
  geom_point() +
  geom_hline(yintercept = 3*(9+1)/219, 
             color = "red",
             linetype = "dashed") +
  geom_hline(yintercept = 2*(9+1)/219,
             color = "blue") +
  geom_text(aes(label=ifelse((.hat > 2*(9+1)/219), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

library(car)
m1_stresid <- FirstYearGPA %>% mutate(esresid = rstudent(m1))
head(m1_stresid)
   GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound ID    esresid
1 3.06  3.83  680  770    1  3.0  9.0        1     1            1  1 -0.3523421
2 4.15  4.00  740  720    0  9.0  3.0        0     1            1  2  2.0524744
3 3.41  3.70  640  570    0 16.0 13.0        0     0            1  3  0.6808208
4 3.21  3.51  740  700    0 22.0  0.0        0     1            1  4 -0.3167075
5 3.48  3.83  610  610    0 30.5  1.5        0     1            1  5 -0.1895846
6 2.95  3.25  600  570    0 18.0  3.0        0     1            1  6 -0.3012032
ggplot(data = m1_stresid,
       aes(x = c(1:nrow(m1_stresid)), 
           y = esresid)) +
  geom_point() +
  geom_hline(yintercept = 2, 
             color = "blue",
             linetype = "dashed") +
  geom_hline(yintercept = -2,
             color = "blue",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((esresid > 2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2) +
  geom_text(aes(label=ifelse((esresid < -2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .cooksd)) +
  geom_point() +
  geom_hline(yintercept = 1, 
             color = "red",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((.cooksd > 1), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

  1. Summarize your findings about collinearity and outliers. Use plain language.
  • Leverage
m1_aug %>% filter(.hat > 2*(9+1)/219) %>% select(ID)
# A tibble: 13 × 1
      ID
   <int>
 1     8
 2    43
 3    69
 4    73
 5   104
 6   108
 7   121
 8   143
 9   161
10   179
11   184
12   192
13   203
  • Discrepancy
m1_stresid %>% filter(esresid > 2) %>% select(ID)
   ID
1   2
2  79
3 171
m1_stresid %>% filter(esresid < -2) %>% select(ID)
   ID
1  39
2  48
3 145
4 172
  • Influence
m1_aug %>% filter(.cooksd > 1) %>% select(ID)
# A tibble: 0 × 1
# ℹ 1 variable: ID <int>

13 variables had high leverage as measured by leverage measure from the hat matrix with a cut-off of 2k/n. An additional 7 had high discrepancy as measured by the externally studentized residuals with a cut-off of \pm 2.

There were no cases with high influence as measured by Cook’s D with a cut-off of greater than 1. This indicates that no cases were changing the overall predicted values.